In this Lab session, we will focus on Machine Learning (ML), as introduced in Lecture 5
We will review examples of both supervised and unsupervised ML algorithms
Supervised ML algorithms examples
Logistic regression
🌳 Random Forest / decision trees 🌲
Unsupervised ML algorithms examples
K-means Clustering
PCA for dimension reduction
(optional) PLS-DA for classification, a supervised ML alternative to PCA
🟠 ACKNOWLEDGEMENTS
The examples and datasets in this Lab session follow very closely two sources:
The tutorial on “Principal Component Analysis (PCA) in R” by: Statistics Globe
R ENVIRONMENT SET UP & DATA
Needed R Packages
We will use functions from packages base, utils, and stats (pre-installed and pre-loaded)
We may also use the packages below (specifying package::function for clarity).
# Load pckgs for this R session# --- General library(here) # tools find your project's files, based on working directory
here() starts at /Users/luisamimmi/Github/R4stats
library(dplyr) # A Grammar of Data Manipulation
Attaching package: 'dplyr'
The following objects are masked from 'package:stats':
filter, lag
The following objects are masked from 'package:base':
intersect, setdiff, setequal, union
library(skimr) # Compact and Flexible Summaries of Datalibrary(magrittr) # A Forward-Pipe Operator for R library(readr) # A Forward-Pipe Operator for R library(tidyr) # Tidy Messy Data
Attaching package: 'tidyr'
The following object is masked from 'package:magrittr':
extract
library(kableExtra) # Construct Complex Table with 'kable' and Pipe Syntax
Attaching package: 'kableExtra'
The following object is masked from 'package:dplyr':
group_rows
# ---Plotting & data visualizationlibrary(ggplot2) # Create Elegant Data Visualisations Using the Grammar of Graphicslibrary(ggfortify) # Data Visualization Tools for Statistical Analysis Resultslibrary(scatterplot3d) # 3D Scatter Plot# --- Statisticslibrary(MASS) # Support Functions and Datasets for Venables and Ripley's MASS
Attaching package: 'MASS'
The following object is masked from 'package:dplyr':
select
library(factoextra) # Extract and Visualize the Results of Multivariate Data Analyses
Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
library(FactoMineR) # Multivariate Exploratory Data Analysis and Data Mininglibrary(rstatix) # Pipe-Friendly Framework for Basic Statistical Tests
Attaching package: 'rstatix'
The following object is masked from 'package:MASS':
select
The following object is masked from 'package:stats':
filter
library(car) # Companion to Applied Regression
Loading required package: carData
Attaching package: 'car'
The following object is masked from 'package:dplyr':
recode
library(ROCR) # Visualizing the Performance of Scoring Classifiers# --- Tidymodels (meta package)library(rsample) # General Resampling Infrastructure library(broom) # Convert Statistical Objects into Tidy Tibbles
a few clean datasets used in the “Core Statistics using R” course by: Martin van Rongen
Dataset on Breast Cancer Biopsy
Name: Biopsy Data on Breast Cancer Patients Documentation: See reference on the data downloaded and conditioned for R here https://cran.r-project.org/web/packages/MASS/MASS.pdf Sampling details: This breast cancer database was obtained from the University of Wisconsin Hospitals, Madison from Dr. William H. Wolberg. He assessed biopsies of breast tumours for 699 patients up to 15 July 1992; each of nine attributes has been scored on a scale of 1 to 10, and the outcome is also known. The dataset contains the original Wisconsin breast cancer data with 699 observations on 11 variables.
Importing Dataset biopsy
The data can be interactively obtained form the MASS R package
# (after loading pckg)# library(MASS) # I can call utils::data(biopsy)
biopsy variables with description
Variable
Type
Description
ID
character
Sample ID
V1
integer 1 - 10
clump thickness
V2
integer 1 - 10
uniformity of cell size
V3
integer 1 - 10
uniformity of cell shape
V4
integer 1 - 10
marginal adhesion
V5
integer 1 - 10
single epithelial cell size
V6
integer 1 - 10
bare nuclei (16 values are missing)
V7
integer 1 - 10
bland chromatin
V8
integer 1 - 10
normal nucleoli
V9
integer 1 - 10
mitoses
class
factor
benign or malignant
biopsy variables exploration
The biopsy data contains 699 observations of 9 continuous variables, V1, V2, …, V9.
The dataset also contains a character variable: ID, and a factor variable: class, with two levels (“benign” and “malignant”).
There is one incomplete variable V6 = “bare nuclei” with 16 missing values.
remember the package skimr for exploring a dataframe?
# check if vars have missing valuesbiopsy %>%# select only variables starting with "V" skimr::skim(starts_with("V")) %>% dplyr::select(skim_variable, n_missing)
LOGISTIC REGRESSION: 1 EXAMPLE of SUPERVISED ML ALGORITHM
Logistic Regr.: review
Logistic regression is used to model a binary response variable, e.g. yes|no, or benign|malignant in the biopsy dataset.
Logistic regression is a type of Generalized Linear Model (GLM): is a more flexible version of linear regression that can work also for categorical response variables (e.g. logistic regression) or count data (e.g. poisson regression).
GLMs (introduced in 1972) provide a unified framework to accommodate response variables that come from a wide range of distributions:
the error distribution (or family) of the response variable can be a normal distribution (continuous response), but also a binomial distribution (binary response), a Poisson distribution (count data), etc.
the link function is used to model the relationship between the predictors and the response variable. For example, in logistic regression, the link function is the logit function.
Logistic Regr.: logit function
If we have predictor variables like \(x_{1,i}\), \(x_{2,i}\), …, \(x_{k,i}\) and a binary response variable \(y_i\) (where \(y_i = 0\) or \(y_i = 1\)), we need a “special” function (or link function) able to transform the expected value of the response variable into the outcome we’re trying to predict. So:
If \(p_i\) is the probability that \(y_i = 1\)… \[
logit (p_i) = \beta_0 + \beta_1 x_{1,i} + \beta_2 x_{2,i} + \cdots + \beta_k x_{k,i}
\]
R fits the coefficients \(\beta_0\), \(\beta_1\), …, \(\beta_k\) to the data using the maximum likelihood estimation method.
Biopsy dataset exploration
Biopsied cells of 700 breast cancer tumors, used to determine if the tumors were benign or malignant.
This determination was based on 9 characteristics of the cells, ranked from 1(benign) to 10(malignant):
1) Clump Thickness – How the cells aggregate. If monolayered they are benign and if clumped on top of each other they are malignant
2) Uniform Size – All cells of the same type should be the same size.
3) Uniform Shape If cells vary in cell shape they could be malignant
4) Marginal Adhesion – Healthy cells have a strong ability to stick together whereas cancerous cells do not
5) Single Epithelial Size – If epithelial cells are not equal in size, it could be a sign of cancer
6) Bare nuclei – If the nucleus of the cell is not surrounded by cytoplasm, the cell could be malignant
7) Bland Chromatin – If the chromatin’s texture is coarse the cell could be malignant
8) Normal Nucleoli – In a healthy cell the nucleoli is small and hard detect via imagery. Enlarged nucleoli could be a sign of cancer
9) Mitosis – cells that multiply at an uncontrollable rate could be malignant
Biopsy dataset cleaning
ID is the patient ID to ensure anonymity (not used in the analysis)
Class represents the diagnosis, either benign or malignant, of the patient.
Our goal is to understand which variables affect the malignancy of a breast cancer tumor.
This will help to predict a diagnosis with the least amount of false positives and false negatives.
# remove rows with missing valuesbiopsy_clean =na.omit(biopsy)# rename the columnscolnames(biopsy_clean) <-c("ID","Clump_Thickness", "Uniform_Size", "Uniform_Shape","Marginal_Adhesion", "Single_Epith_Size", "Bare_Nuclei","Bland_Chromatin", "Normal_Nuclei","Mitosis", "Class")# check the structure of the datasetpaint::paint(biopsy_clean)
data.frame [683, 11]
ID chr 1000025 1002945 1015425 1016277 10170~
Clump_Thickness int 5 5 3 6 4 8
Uniform_Size int 1 4 1 8 1 10
Uniform_Shape int 1 4 1 8 1 10
Marginal_Adhesion int 1 5 1 1 3 8
Single_Epith_Size int 2 7 2 3 2 7
Bare_Nuclei int 1 10 2 4 1 10
Bland_Chromatin int 3 3 3 3 3 9
Normal_Nuclei int 1 2 1 7 1 7
Mitosis int 1 1 1 1 1 1
Class fct benign benign benign benign benign ma~
# remove the ID columnbiopsy_clean$ID =NULL
Biopsy sample splitting
In ML, it is good practice to split the data into training and testing sets.
We will use the training set to fit the model and the testing set to evaluate the model’s performance.
Figure 1: Boxplot of the independent variables - each of the 9 variables is plotted against the class of the tumor - values are consitently higher & more dispersed for malignant tumors
Logistic Regr.: model fitting
We fit a logistic regression model to the biopsy_train data using:
the glm function with argument family = binomial to specify the logistic regression model;
and with Class ~ . to specify an initial model that uses all the variables as predictors (backward elimination approach).
# Building initial model model = stats::glm(Class ~ . , family = binomial, data=biopsy_train)
Table 1 shows the model summary, with the coefficient estimate for each predictor.
the broom::tidy function converts the model summary into a data frame.
Table 1: Complete logistic regression model + coefficients are in the form of natural logarithm of the odds of the event happening + positive estimate indicates an increase in the odds of finding a malignant tumor
term
estimate
std.error
statistic
p.value
signif. lev.
(Intercept)
-9.8386
1.2709
-7.7414
0.0000
***
Clump_Thickness
0.3935
0.1583
2.4857
0.0129
*
Uniform_Size
0.1624
0.2417
0.6719
0.5017
Uniform_Shape
0.2693
0.2509
1.0735
0.2830
Marginal_Adhesion
0.3163
0.1412
2.2405
0.0251
*
Single_Epith_Size
0.0414
0.2001
0.2066
0.8363
Bare_Nuclei
0.4122
0.1060
3.8877
0.0001
***
Bland_Chromatin
0.5573
0.2105
2.6472
0.0081
**
Normal_Nuclei
0.1498
0.1275
1.1745
0.2402
Mitosis
0.5065
0.3474
1.4578
0.1449
Logistic Regr.: coefficients’ interpretation
As discussed before, in logistic regression, the coefficients are in the form of the natural logarithm of the odds of the response event happening (i.e. \(Y_i = 1\)):
However, with some algebraic transformation, the logit function can be inverted to obtain the probability of the response event happening as a function of the predictors:
We can use a statistic called the Akaike Information Criterion (AIC) to compare models.
In AIC, a penalty is given for including additional variables.
The model with the lowest AIC is considered the best model.
Try fitting more parsimonious models by removing variables that are not significant.
# For example let's fit a model without the variable `Uniform_Size`model2 =glm(Class~ .-Uniform_Size, family = binomial, data=biopsy_train)# Compare the AIC values of the 2 modelstibble(Model =c("model", "model2"), AIC =c(AIC(model), AIC(model2) ))
# A tibble: 2 × 2
Model AIC
<chr> <dbl>
1 model 100.
2 model2 98.5
According to the AIC values, the model2 seems better (AIC is lower).
The MASS package’s function stepAIC enables to perform a systematic model selection (by AIC):
The direction argument specifies the direction of the stepwise search.
The trace argument (if set to TRUE) prints out all the steps.
The best_model has removed these variables:
Uniform_Shape
Single_Epith_Size
Normal_Nuclei
The best_model has the lowest AIC value (from 100 to 98.5), despite a higher Residual Deviance than the full model (from 80 to 80.6), albeit by a very slight amount.
# Select the best model based on AICbest_model <- MASS::stepAIC(model, direction ="both", trace =FALSE)# Compare the AIC values of full and best modeltibble(Model =c("model", "best_model"), AIC =c(AIC(model), AIC(model2)),Deviance =c(deviance(model), deviance(best_model))) %>%kable()
Model
AIC
Deviance
model
100.00985
80.00985
best_model
98.49005
80.60701
Logistic Regr.: model evaluation on test data
We can use the predict function to predict the class of the test data using the best model.
Then we use the ROCR package evaluate and visualize the classification
the performance function calculates the True Positive Rate (TPR) and False Positive Rate (FPR) for various thresholds. From it we can get:
tpr = True Positive Rate (TPR) is the proportion of actual positive cases that are correctly predicted as positive.
fpr = False Positive Rate (FPR) is the proportion of actual negative cases that are incorrectly predicted as positive.
thresholds = the list of thresholds used to calculate the TPR and FPR.
# Fitted value for the test data 205 samples based on modelpred = stats::predict(best_model, biopsy_test, type ="response")# Create a prediction object for the ROCR packageROCRPRed <- ROCR::prediction(predictions = pred, labels = biopsy_test$Class)# Create a performance object (True Positive Rate and False Positive Rate) for various thresholds.ROCRPerf <- ROCR::performance(ROCRPRed, "tpr", "fpr")# Extract TPR and FPR datafpr <- ROCRPerf@x.values[[1]] # xtpr <- ROCRPerf@y.values[[1]] # y# List of thresholds used to calculate the TPR and FPR in the performance objectthresholds <- ROCRPerf@alpha.values[[1]]
🙄🤔 Logistic Regr.: visualization of model performance at varying thresholds
# Create a data frame for ggplot2perf_data <-data.frame(FPR = fpr, TPR = tpr, Threshold = thresholds)# Plot using ggplot2library(ggplot2)ggplot(perf_data, aes(x = FPR, y = TPR, color = Threshold)) +geom_line(linewidth =1) +scale_color_viridis_c(option ="C") +# Optional: better color scalelabs(title ="TPR vs FPR",x ="False Positive Rate [0, 0.1]",y ="True Positive Rate [0.8, 1]",color ="Threshold ") +coord_cartesian(xlim =c(0, 0.1), ylim =c(0.8, 1)) +theme_minimal()
The practical use of this is to identify a threshold where the True Positive Rate (TPR) is acceptably high, without a significant increase in False Positive Rate (FPR).
🙄🤔 Logistic Regr.: visualization of model performance at varying thresholds
Threshold color gradient: YELLOW = the model predicts positive only for very confident cases, PURPLE: the model predicts positive for many cases, even with lower confidence.
Logistic Regr.: confusion matrix
We create a confusion matrix with different acceptance rates so we can see how it affects the true and false positivity rates.
Since our data involves diagnosing malignant tumors, it is important to keep the false negative rate low as this would be telling someone who has a malignant tumor that it is benign.
We found that a cutoff of 0.4 gives a good balance of low false negatives while still maintaining a high true positive rate.
predicted.test <-predict(best_model, biopsy_test, type ="response")predicted.data <-data.frame(prob.of.malig=predicted.test, malig = biopsy_test$Class)predicted.data <- predicted.data[order(predicted.data$prob.of.malig, decreasing = F),]predicted.data$rank <-1:nrow(predicted.data)plot_ROC <-ggplot(data=predicted.data, aes(x=rank, y=prob.of.malig)) +geom_point(aes(color=malig), alpha=1, shape=4, stroke=2) +xlab("Index") +ylab("Predicted Probability of Tumor Being Malignant") plot_ROC
Logistic Regr.: ROC curve
This shows why lowering the cutoff improves the accuracy of the model as some malignant tumors are being underestimated which would cause false negatives.
Logistic Regr.: conclusions
Uniform Size and Single Epithithial Size were not significant in predicting the malignancy of tumor cells so our model does not include these variables.
Our fitted model reduces the null deviance and AIC without impacting the residual deviance by a significant amount and is able to predict the testing dataset with >90% accuracy.
For further analysis, we could run the model multiple times because our original and revised model are similar.
New training and testing data would help confirm our results and help identify possible overfitting.
…
🟠 K-MEANS CLUSTERING: EXAMPLE of UNSUPERVISED ML ALGORITHM
…
PCA: EXAMPLE of UNSUPERVISED ML ALGORITHM
Reducing high-dimensional data to a lower number of variables
biopsy dataset manipulation
We will:
exclude the non-numerical variables (ID and class) before conducting the PCA.
exclude the individuals with missing values using the na.omit() or filter(complete.cases() functions.
We can do both in 2 equivalent ways:
with base R (more compact)
# new (manipulated) dataset data_biopsy <-na.omit(biopsy[,-c(1,11)])
with dplyr (more explicit)
# new (manipulated) dataset data_biopsy <- biopsy %>%# drop incomplete & non-integer columns dplyr::select(-ID, -class) %>%# drop incomplete observations (rows) dplyr::filter(complete.cases(.))
biopsy dataset manipulation
We obtained a new dataset with 9 variables and 683 observations (instead of the original 699).
The first step of PCA is to calculate the principal components. To accomplish this, we use the prcomp() function from the stats package.
With argument “scale = TRUE” each variable in the biopsy data is scaled to have a mean of 0 and a standard deviation of 1 before calculating the principal components (just like option Autoscaling in MetaboAnalyst)
# calculate principal componentbiopsy_pca <-prcomp(data_biopsy, # standardize variablesscale =TRUE)
Analyze Principal Components
Let’s check out the elements of our obtained biopsy_pca object
(All accessible via the $ operator)
names(biopsy_pca)
[1] "sdev" "rotation" "center" "scale" "x"
“sdev” = the standard deviation of the principal components
“sdev”^2 = the variance of the principal components (eigenvalues of the covariance/correlation matrix)
“rotation” = the matrix of variable loadings (i.e., a matrix whose columns contain the eigenvectors).
“center” and “scale” = the means and standard deviations of the original variables before the transformation;
“x” = the principal component scores (after PCA the observations are expressed in principal component scores)
Analyze Principal Components (cont.)
We can see the summary of the analysis using the summary() function
The first row gives the Standard deviation of each component, which can also be retrieved via biopsy_pca$sdev.
The second row shows the Proportion of Variance, i.e. the percentage of explained variance.
The output suggests the 1st principal component explains around 65% of the total variance, the 2nd principal component explains about 9% of the variance, and this goes on with diminishing proportion for each component.
Cumulative Proportion of variance for components
The last row from the summary(biopsy_pca), shows the Cumulative Proportion of variance, which calculates the cumulative sum of the Proportion of Variance.
# Extracting Cumulative Proportion from summarysummary(biopsy_pca)$importance[3,]
Once you computed the PCA in R you must decide the number of components to retain based on the obtained results.
VISUALIZING PCA OUTPUTS
Scree plot
There are several ways to decide on the number of components to retain.
One helpful option is visualizing the percentage of explained variance per principal component via a scree plot.
Plotting with the fviz_eig() function from the factoextra package
# Scree plot shows the variance of each principal component factoextra::fviz_eig(biopsy_pca, addlabels =TRUE, ylim =c(0, 70))
Visualization is essential in the interpretation of PCA results. Based on the number of retained principal components, which is usually the first few, the observations expressed in component scores can be plotted in several ways.
Scree plot
The obtained scree plot simply visualizes the output of summary(biopsy_pca).
Principal Component Scores
After a PCA, the observations are expressed as principal component scores.
We can retrieve the principal component scores for each Variable by calling biopsy_pca$x, and store them in a new dataframe PC_scores.
Next we draw a scatterplot of the observations – expressed in terms of principal components
# Create new object with PC_scoresPC_scores <-as.data.frame(biopsy_pca$x)head(PC_scores)
It is also important to visualize the observations along the new axes (principal components) to interpret the relations in the dataset:
Principal Component Scores plot (adding label variable)
When data includes a factor variable, like in our case, it may be interesting to show the grouping on the plot as well.
In such cases, the label variable class can be added to the PC set as follows.
# retrieve class variablebiopsy_no_na <-na.omit(biopsy)# adding class grouping variable to PC_scoresPC_scores$Label <- biopsy_no_na$class
The visualization of the observation points (point cloud) could be in 2D or 3D.
Principal Component Scores plot (2D)
The Scores Plot can be visualized via the ggplot2 package.
grouping is indicated by argument the color = Label;
geom_point() is used for the point cloud.
ggplot(PC_scores, aes(x = PC1, y = PC2, color = Label)) +geom_point() +scale_color_manual(values=c("#245048", "#CC0066")) +ggtitle("Figure 1: Scores Plot") +theme_bw()
Principal Component Scores plot (2D)
Figure 1 shows the observations projected into the new data space made up of principal components
Principal Component Scores (2D Ellipse Plot)
Confidence ellipses can also be added to a grouped scatter plot visualized after a PCA. We use the ggplot2 package.
grouping is indicated by argument the color = Label;
geom_point() is used for the point cloud;
the stat_ellipse() function is called to add the ellipses per biopsy group.
ggplot(PC_scores, aes(x = PC1, y = PC2, color = Label)) +geom_point() +scale_color_manual(values=c("#245048", "#CC0066")) +stat_ellipse() +ggtitle("Figure 2: Ellipse Plot") +theme_bw()
Principal Component Scores (2D Ellipse Plot)
Figure 2 shows the observations projected into the new data space made up of principal components, with 95% confidence regions displayed.
Principal Component Scores plot (3D)
A 3D scatterplot of observations shows the first 3 principal components’ scores.
For this one, we need the scatterplot3d() function of the scatterplot3d package;
The color argument assigned to the Label variable;
To add a legend, we use the legend() function and specify its coordinates via the xyz.convert() function.
# 3D scatterplot ...plot_3d <-with(PC_scores, scatterplot3d::scatterplot3d(PC_scores$PC1, PC_scores$PC2, PC_scores$PC3, color =as.numeric(Label), pch =19, main ="Figure 3: 3D Scatter Plot", xlab="PC1",ylab="PC2",zlab="PC3"))# ... + legendlegend(plot_3d$xyz.convert(0.5, 0.7, 0.5), pch =19, yjust=-0.6,xjust=-0.9,legend =levels(PC_scores$Label), col =seq_along(levels(PC_scores$Label)))
Principal Component Scores plot (3D)
Figure 3 shows the observations projected into the new 3D data space made up of principal components.
Biplot: principal components v. original variables
Next, we create another special type of scatterplot (a biplot) to understand the relationship between the principal components and the original variables.
In the biplot each of the observations is projected onto a scatterplot that uses the first and second principal components as the axes.
For this plot, we use the fviz_pca_biplot() function from the factoextra package
We will specify the color for the variables, or rather, for the “loading vectors”
The habillage argument allows to highlight with color the grouping by class
Biplot: principal components v. original variables
The axes show the principal component scores, and the vectors are the loading vectors
Interpreting biplot output
Biplots have two key elements: scores (the 2 axes) and loadings (the vectors). As in the scores plot, each point represents an observation projected in the space of principal components where:
Biopsies of the same class are located closer to each other, which indicates that they have similar scores referred to the 2 main principal components;
The loading vectors show strength and direction of association of original variables with new PC variables.
As expected from PCA, the single PC1 accounts for variance in almost all original variables, while V9 has the major projection along PC2.
Interpreting biplot output (cont.)
scores <- biopsy_pca$xloadings <- biopsy_pca$rotation# excerpt of first 2 componentsloadings[ ,1:2]
Motivated the choice of learning/using R for scientific quantitative analysis, and lay out some fundamental concepts in biostatistics with concrete R coding examples.
Consolidated understanding of inferential statistic, through R coding examples conducted on real biostatistics research data.
Discussed the relationship between any two variables, and introduce a widely used analytical tool: regression.
Presented a popular ML technique for dimensionality reduction (PCA), performed both with MetaboAnalyst and R.
Introduction to power analysis to define the correct sample size for hypotheses testing and discussion of how ML approaches deal with available data.
Final thoughts
While the workshop only allowed for a synthetic overview of fundamental ideas, it hopefully provided a solid foundation on the most common statistical analysis you will likely run in your daily work:
Thorough understanding of the input data and the data collection process
Univariate and bivariate exploratory analysis (accompanied by visual intuition) to form hypothesis
Upon verifying the assumptions, we fit data to hypothesized model(s)
Assessment of the model performance (\(R^2\), \(Adj. R^2\), \(F-Statistic\), etc.)
You should now have a solid grasp on the R language to keep using and exploring the huge potential of this programming ecosystem
We only scratched the surface in terms of ML classification and prediction models, but we got a hang of the fundamental steps and some useful tools that might serve us also in more advanced analysis